HVT Scoring Cells with Layers using scoreLayeredHVT

Zubin Dowlaty, Srinivasan Sudarsanam, Somya Shambhawi, Vishwavani

2024-05-07

1. Abstract

The HVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data analysis. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below:

  1. Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective.

  2. Data Projection: Dimension projection of the compressed cells to 1D,2D or Interactive surface plot with the Sammons Non-linear Algorithm. This step creates topology preserving map (also called as embedding) coordinates into the desired output dimension.

  3. Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map. Useful for semi-supervised tasks.

  4. Scoring: Scoring new data sets and recording their assignment using the map objects from the above steps, in a sequence of maps if required.

2. Import Code Modules

Here is the guide to install the HVT package. This helps user to install the most recent version of the HVT package.

###direct installation###
#install.packages("HVT")

#or

###git repo installation###
#library(devtools)
#devtools::install_github(repo = "Mu-Sigma/HVT")

NOTE: At the time documenting this vignette, the updated changes were not still in CRAN, hence we are sourcing the scripts from the R folder directly to the session environment.

# Sourcing required code scripts for HVT
script_dir <- "../R"
r_files <- list.files(script_dir, pattern = "\\.R$", full.names = TRUE)
invisible(lapply(r_files, function(file) { source(file, echo = FALSE); }))

3. Example : HVT with the Torus dataset

In this section, we will see how we can use the package to visualize multidimensional data by projecting them to two dimensions using Sammon’s projection and further used for Scoring.

Data Understanding

First of all, let us see how to generate data for torus. We are using a library geozoo for this purpose. Geo Zoo (stands for Geometric Zoo) is a compilation of geometric objects ranging from three to 10 dimensions. Geo Zoo contains regular or well-known objects, eg cube and sphere, and some abstract objects, e.g. Boy’s surface, Torus and Hyper-Torus.

Here, we will generate a 3D torus (a torus is a surface of revolution generated by revolving a circle in three-dimensional space one full revolution about an axis that is coplanar with the circle) with 12000 points.

Raw Torus Dataset

The torus dataset includes the following columns:

Lets, explore the torus dataset containing 12000 points. For the sake of brevity we are displaying first 6 rows.

set.seed(240)
# Here p represents dimension of object, n represents number of points
torus <- geozoo::torus(p = 3,n = 12000)
torus_df <- data.frame(torus$points)
colnames(torus_df) <- c("x","y","z")
torus_df <- torus_df %>% round(4)
Table(head(torus_df), scroll = FALSE)
x y z
-2.6282 0.5656 -0.7253
-1.4179 -0.8903 0.9455
-1.0308 1.1066 -0.8731
1.8847 0.1895 0.9944
-1.9506 -2.2507 0.2071
-1.4824 0.9229 0.9672

Now let’s have a look at structure of the torus dataset.

str(torus_df)
#> 'data.frame':    12000 obs. of  3 variables:
#>  $ x: num  -2.63 -1.42 -1.03 1.88 -1.95 ...
#>  $ y: num  0.566 -0.89 1.107 0.19 -2.251 ...
#>  $ z: num  -0.725 0.946 -0.873 0.994 0.207 ...

Data distribution

This section displays four objects.

  1. Variable Histograms: The histogram distribution of all the variables in the dataset.

  2. Box Plots: Box plots for each numeric column in the dataset across panels. These plots will display the median and Inter Quartile Range of each column at a panel level.

  3. Correlation Matrix: This calculates the pearson correlation which is a bivariate correlation value measuring the linear correlation between two numeric columns. The output plot is shown as a matrix.

  4. Summary EDA: The table provides descriptive statistics for all the variables in the dataset.

It uses an inbuilt function called edaPlots to display the above mentioned four objects.

edaPlots(torus_df)
variable min 1st Quartile median mean sd 3rd Quartile max hist n_row n_missing
x -2.9977 -1.1490 -0.0070 -0.0014 1.5060 1.1403 2.9995 ▅▇▇▇▅ 12000 0
y -2.9993 -1.1133 0.0130 0.0103 1.4856 1.1337 2.9993 ▃▇▇▇▅ 12000 0
z -1.0000 -0.7120 0.0153 0.0044 0.7118 0.7186 1.0000 ▇▃▃▃▇ 12000 0

Train - Test Split

Let us split the torus dataset into train and test. We will randomly select 80% of the torus dataset as train and remaining as test.

smp_size <- floor(0.80 * nrow(torus_df))
set.seed(279)
train_ind <- sample(seq_len(nrow(torus_df)), size = smp_size)
torus_train <- torus_df[train_ind, ]
torus_test <- torus_df[-train_ind, ]

Training Dataset

Now, lets have a look at the selected training dataset containing (9600 data points). For the sake of brevity we are displaying first six rows.

rownames(torus_train) <- NULL
Table(head(torus_train), scroll= FALSE)
x y z
1.7958 -0.4204 -0.9878
0.7115 -2.3528 -0.8889
1.9285 1.2034 0.9620
1.0175 0.0344 -0.1894
-0.2736 1.1298 -0.5464
1.8976 2.2391 0.3545

Now lets have a look at structure of the training dataset.

str(torus_train)
#> 'data.frame':    9600 obs. of  3 variables:
#>  $ x: num  1.796 0.712 1.929 1.018 -0.274 ...
#>  $ y: num  -0.4204 -2.3528 1.2034 0.0344 1.1298 ...
#>  $ z: num  -0.988 -0.889 0.962 -0.189 -0.546 ...

Data Distribution

edaPlots(torus_train)
variable min 1st Quartile median mean sd 3rd Quartile max hist n_row n_missing
x -2.9973 -1.1514 -0.0102 -0.0055 1.5057 1.1254 2.9995 ▅▇▇▇▅ 9600 0
y -2.9993 -1.1078 0.0209 0.0163 1.4832 1.1377 2.9993 ▃▇▇▇▅ 9600 0
z -1.0000 -0.7067 0.0147 0.0046 0.7100 0.7168 1.0000 ▇▃▃▃▇ 9600 0

Testing Dataset

Now, lets have a look at testing dataset containing(2400 data points).For the sake of brevity we are displaying first six rows.

rownames(torus_test) <- NULL
Table(head(torus_test), scroll = FALSE)
x y z
-2.6282 0.5656 -0.7253
2.7471 -0.9987 -0.3848
-2.4446 -1.6528 0.3097
-2.6487 -0.5745 0.7040
-0.2676 -1.0800 -0.4611
-1.1130 -0.6516 -0.7040

Now lets have a look at structure of the testing dataset.

str(torus_test)
#> 'data.frame':    2400 obs. of  3 variables:
#>  $ x: num  -2.628 2.747 -2.445 -2.649 -0.268 ...
#>  $ y: num  0.566 -0.999 -1.653 -0.575 -1.08 ...
#>  $ z: num  -0.725 -0.385 0.31 0.704 -0.461 ...

Data Distribution

edaPlots(torus_test)
variable min 1st Quartile median mean sd 3rd Quartile max hist n_row n_missing
x -2.9977 -1.1310 0.0001 0.0147 1.5072 1.1934 2.9908 ▅▇▇▇▅ 2400 0
y -2.9918 -1.1314 -0.0001 -0.0133 1.4951 1.1118 2.9861 ▃▇▇▇▅ 2400 0
z -1.0000 -0.7337 0.0157 0.0036 0.7192 0.7311 1.0000 ▇▃▃▃▇ 2400 0

4. Map A : Base Compressed Map

Let us try to visualize the compressed Map A from the diagram below.

Figure 1: Data Segregation with highlighted bounding box in red around compressed map A

Figure 1: Data Segregation with highlighted bounding box in red around compressed map A

This package can perform vector quantization using the following algorithms -

For more information on vector quantization, refer the following link.

The trainHVT function constructs highly compressed hierarchical Voronoi tessellations. The raw data is first scaled and this scaled data is supplied as input to the vector quantization algorithm. The vector quantization algorithm compresses the dataset until a user-defined compression percentage rate is achieved using a parameter called quantization error which acts as a threshold and determines the compression percentage. It means that for a given user-defined compression percentage we get the ‘n’ number of cells, then all of these cells formed will have a quantization error less than the threshold quantization error.

Let’s try to comprehend the trainHVT function first before moving ahead.

trainHVT(
  dataset,
  min_compression_perc,
  n_cells,
  depth,
  quant.err,
  distance_metric = c("L1_Norm", "L2_Norm"),
  error_metric = c("mean", "max"),
  quant_method = c("kmeans", "kmedoids"),
  normalize = TRUE,
  seed = 279,
  diagnose = FALSE,
  hvt_validation = FALSE,
  train_validation_split_ratio = 0.8
)

Each of the parameters of trainHVT function have been explained below:

The output of trainHVT function (list of 7 elements) have been explained below with an image attached for clear understanding.

NOTE: Here the attached image is the snapshot of output list generated from map A which can be referred later in this section

Figure 2: The Output list generated by trainHVT function.

Figure 2: The Output list generated by trainHVT function.

We will use the trainHVT function to compress our data while preserving essential features of the dataset. Our goal is to achieve data compression upto atleast 80%. In situations where the compression ratio does not meet the desired target, we can explore adjusting the model parameters as a potential solution. This involves making modifications to parameters such as the quantization error threshold or increasing the number of cells and then rerunning the trainHVT function again.

As this is already done in HVT Vignette: please refer for more information.

Model Parameters

set.seed(240)
torus_mapA <- trainHVT(
  torus_train,
  n_cells = 900,
  depth = 1,
  quant.err = 0.1,
  normalize = FALSE,
  distance_metric = "L1_Norm",
  error_metric = "max",
  quant_method = "kmeans"
)

Let’s check the compression summary for torus.

displayTable(data = torus_mapA[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 900 749 0.83 n_cells: 900 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans

We successfully compressed 83% of the data using n_cells parameter as 900, the next step involves performing data projection on the compressed data. In this step, the compressed data will be transformed and projected onto a lower-dimensional space to visualize and analyze the data in a more manageable form.

As per the manual, torus_mapA[[3]] gives us detailed information about the hierarchical vector quantized data. torus_mapA[[3]][['summary']] gives a nice tabular data containing no of points, Quantization Error and the codebook.

The datatable displayed below is the summary from torus_mapA showing Cell.ID, Centroids and Quantization Error for each of the 900 cells. For the sake of brevity, we are displaying only the first 100 rows.

displayTable(data =torus_mapA[[3]][['summary']], columnName= 'Quant.Error', value = 0.1, tableType = "summary")
Segment.Level Segment.Parent Segment.Child n Cell.ID Quant.Error x y z
1 1 1 11 82 0.07 -0.99 -2.14 0.93
1 1 2 8 164 0.07 1.91 -1.80 0.78
1 1 3 11 392 0.05 1.03 -0.88 -0.76
1 1 4 6 126 0.04 0.21 -2.08 0.99
1 1 5 6 786 0.05 0.53 1.91 -1.00
1 1 6 12 831 0.08 1.70 1.77 0.89
1 1 7 10 460 0.06 -0.96 0.30 0.11
1 1 8 7 399 0.05 0.72 -0.69 0.04
1 1 9 14 661 0.09 2.01 0.31 -1.00
1 1 10 7 169 0.07 -1.53 -1.31 -1.00
1 1 11 8 544 0.06 -1.29 0.93 -0.91
1 1 12 7 48 0.07 -1.84 -2.04 -0.67
1 1 13 10 698 0.07 0.60 1.40 0.88
1 1 14 10 3 0.12 -1.33 -2.64 0.25
1 1 15 13 416 0.05 -1.05 -0.01 0.32
1 1 16 7 442 0.06 2.41 -0.73 0.85
1 1 17 10 704 0.08 0.22 1.46 -0.85
1 1 18 7 238 0.05 -1.54 -0.81 -0.96
1 1 19 14 320 0.06 -0.56 -0.83 0.02
1 1 20 6 306 0.06 -0.90 -0.77 -0.58
1 1 21 11 124 0.12 2.21 -2.00 -0.15
1 1 22 8 526 0.05 0.98 0.23 -0.05
1 1 23 5 877 0.11 0.71 2.84 0.36
1 1 24 5 853 0.05 1.15 2.31 -0.81
1 1 25 10 550 0.08 2.05 -0.13 -0.99
1 1 26 7 892 0.11 2.34 1.86 -0.09
1 1 27 5 682 0.07 -2.47 1.56 0.37
1 1 28 11 752 0.07 -0.26 1.88 -0.99
1 1 29 10 12 0.1 -0.40 -2.94 0.23
1 1 30 11 308 0.09 -2.99 0.24 -0.04
1 1 31 14 734 0.09 -1.99 1.90 0.65
1 1 32 8 652 0.08 -0.43 1.38 0.84
1 1 33 12 688 0.08 -1.06 1.66 1.00
1 1 34 5 795 0.05 -0.61 2.35 -0.90
1 1 35 8 282 0.06 -0.65 -0.92 0.49
1 1 36 9 44 0.08 0.08 -2.79 -0.61
1 1 37 10 558 0.08 -0.72 0.96 -0.61
1 1 38 4 85 0.04 0.35 -2.48 0.86
1 1 39 13 673 0.06 1.07 0.98 -0.83
1 1 40 5 609 0.04 1.17 0.61 0.73
1 1 41 6 743 0.08 -2.32 1.89 0.05
1 1 42 12 779 0.11 1.96 1.14 0.96
1 1 43 10 726 0.06 -0.07 1.74 0.97
1 1 44 6 65 0.05 0.54 -2.66 0.70
1 1 45 15 833 0.1 1.81 1.66 -0.88
1 1 46 11 589 0.07 0.68 0.77 0.25
1 1 47 12 240 0.1 -1.60 -0.67 0.96
1 1 48 14 705 0.07 1.74 0.76 0.99
1 1 49 8 557 0.05 0.86 0.52 -0.11
1 1 50 10 230 0.05 -1.93 -0.53 1.00
1 1 51 9 227 0.06 -0.23 -1.35 0.78
1 1 52 12 127 0.1 -1.89 -1.27 0.96
1 1 53 15 464 0.07 -1.00 0.35 0.32
1 1 54 10 865 0.09 1.23 2.42 0.69
1 1 55 13 121 0.08 -2.27 -1.07 0.86
1 1 56 8 357 0.08 -2.06 0.09 1.00
1 1 57 7 769 0.06 -0.75 2.15 -0.96
1 1 58 13 425 0.1 -2.69 0.55 0.66
1 1 59 11 556 0.04 0.85 0.53 0.09
1 1 60 10 100 0.13 -2.52 -1.05 0.67
1 1 61 10 811 0.1 -0.87 2.58 -0.68
1 1 62 8 135 0.05 0.56 -2.08 0.99
1 1 63 11 580 0.08 -0.61 1.09 0.65
1 1 64 7 610 0.06 -1.75 1.25 -0.99
1 1 65 8 86 0.09 -0.46 -2.34 -0.92
1 1 66 12 250 0.08 -1.97 -0.45 -1.00
1 1 67 9 368 0.06 -1.45 -0.20 -0.84
1 1 68 11 163 0.09 1.19 -2.08 -0.91
1 1 69 7 448 0.1 -2.90 0.67 0.15
1 1 70 12 615 0.12 1.47 0.43 0.88
1 1 71 8 280 0.05 -0.34 -1.09 -0.51
1 1 72 14 420 0.08 0.95 -0.66 -0.54
1 1 73 14 336 0.07 -0.77 -0.68 0.23
1 1 74 20 402 0.1 1.39 -0.74 0.90
1 1 75 7 611 0.05 0.24 1.02 0.31
1 1 76 17 438 0.06 -1.04 0.15 0.31
1 1 77 14 645 0.05 0.28 1.13 -0.55
1 1 78 10 751 0.09 2.87 0.29 -0.45
1 1 79 8 157 0.09 1.47 -1.93 0.90
1 1 80 7 875 0.05 1.26 2.49 -0.61
1 1 81 16 575 0.06 -0.33 0.97 -0.23
1 1 82 5 84 0.05 1.81 -2.36 0.18
1 1 83 20 572 0.08 -0.45 1.01 0.45
1 1 84 8 640 0.06 0.01 1.24 0.65
1 1 85 15 112 0.13 1.85 -2.19 -0.49
1 1 86 12 224 0.07 -0.33 -1.42 -0.84
1 1 87 14 35 0.12 -1.04 -2.56 -0.63
1 1 88 14 476 0.08 -1.06 0.47 0.54
1 1 89 13 574 0.08 -0.30 0.98 0.22
1 1 90 14 47 0.1 0.18 -2.76 0.63
1 1 91 10 517 0.05 1.01 0.15 0.19
1 1 92 17 71 0.13 -1.33 -2.15 -0.84
1 1 93 18 286 0.08 -2.06 -0.24 -0.99
1 1 94 6 894 0.06 2.08 2.08 -0.32
1 1 95 8 807 0.13 -1.11 2.58 0.58
1 1 96 11 29 0.11 -0.11 -2.87 -0.47
1 1 97 4 719 0.04 -2.03 1.81 -0.69
1 1 98 9 889 0.11 2.21 1.95 0.30
1 1 99 7 22 0.07 0.06 -2.99 0.02
1 1 100 6 186 0.05 -0.53 -1.52 0.92

Now let us understand what each column in the above table means:

All the columns after this will contain centroids for each cell. They can also be called a codebook, which represents a collection of all centroids or codewords.

Now let’s try to understand plotHVT function. The parameters have been explained in detail below:

plotHVT <-(hvt.results, line.width, color.vec, pch1 = 21, centroid.size = 1.5, 
           title = NULL, maxDepth = NULL, child.level, hmap.cols,
           quant.error.hmap = NULL, n_cells.hmap = NULL, label.size = 0.5, 
           sepration_width = 7, layer_opacity = c(0.5, 0.75, 0.99), dim_size = 1000, plot.type = '2Dhvt') 

Let’s plot the Voronoi tessellation for layer 1 (map A).

plotHVT(torus_mapA,
        line.width = c(0.4), 
        color.vec = c("navy blue"),
        centroid.size = 0.01,
        maxDepth = 1,
        plot.type = "2Dhvt") 
Figure 3: The Voronoi Tessellation for layer 1 (map A) shown for the 900 cells in the dataset ’torus’

Figure 3: The Voronoi Tessellation for layer 1 (map A) shown for the 900 cells in the dataset ’torus’

4.1 Heatmaps

Now let’s plot the Voronoi Tessellation with the heatmap overlaid for all the features in the torus dataset for better visualization and interpretation of data patterns and distributions.

The heatmaps displayed below provides a visual representation of the spatial characteristics of the torus dataset, allowing us to observe patterns and trends in the distribution of each of the features (x,y,z). The sheer green shades highlight regions with higher values in each of the heatmaps, while the indigo shades indicate areas with the lowest values in each of the heatmaps. By analyzing these heatmaps, we can gain insights into the variations and relationships between each of these features within the torus dataset.


  plotHVT(
  torus_mapA,
  child.level = 1,
  hmap.cols = "x",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
) 
Figure 4: The Voronoi Tessellation with the heat map overlaid for variable ’x’ in the ’torus’ dataset

Figure 4: The Voronoi Tessellation with the heat map overlaid for variable ’x’ in the ’torus’ dataset


  plotHVT(
  torus_mapA,
  child.level = 1,
  hmap.cols = "y",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
) 
Figure 5: The Voronoi Tessellation with the heat map overlaid for variable ’y’ in the ’torus’ dataset

Figure 5: The Voronoi Tessellation with the heat map overlaid for variable ’y’ in the ’torus’ dataset


  plotHVT(
  torus_mapA,
  child.level = 1,
  hmap.cols = "z",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
) 
Figure 6: The Voronoi Tessellation with the heat map overlaid for variable ’z’ in the ’torus’ dataset

Figure 6: The Voronoi Tessellation with the heat map overlaid for variable ’z’ in the ’torus’ dataset

5. Map B : Compressed Novelty Map

Let us try to visualize the Map B from the diagram below.

Figure 7: Data Segregation with highlighted bounding box in red around map B

Figure 7: Data Segregation with highlighted bounding box in red around map B

In this section, we will manually figure out the novelty cells from the plotted torus_mapA and store it in identified_Novelty_cells variable.

Note: For manual selecting the novelty cells from map A, one can enhance its interactivity by adding plotly elements to the code. This will transform map A into an interactive plot, allowing users to actively engage with the data. By hovering over the centroids of the cells, a tag containing segment child information will be displayed. Users can explore the map by hovering over different cells and selectively choose the novelty cells they wish to consider. Added an image for reference.

Figure 8: Manually selecting novelty cells

Figure 8: Manually selecting novelty cells

The removeNovelty function removes the identified novelty cell(s) from the training dataset (containing 9600 datapoints) and stores those records separately.

It takes input as the cell number (Segment.Child) of the manually identified novelty cell(s) and the compressed HVT map (torus_mapA) with 900 cells. It returns a list of two items: data with novelty, and data without novelty.

NOTE: As we are using torus dataset here, the identified novelty cells given are for demo purpose.

identified_Novelty_cells <<- c(595,425,165,875,822,697,166)   #as a example
output_list <- removeNovelty(identified_Novelty_cells, torus_mapA)
data_with_novelty <- output_list[[1]]
data_without_novelty <- output_list[[2]]

Let’s have a look at the data with novelty(containing 83 records).

novelty_data <- data_with_novelty
novelty_data$Row.No <- row.names(novelty_data)
novelty_data <- novelty_data %>% dplyr::select("Row.No","Cell.ID","Cell.Number","x","y","z")
colnames(novelty_data) <- c("Row.No","Cell.ID","Segment.Child","x","y","z")
Table(novelty_data,scroll = TRUE, limit = 100)
Row.No Cell.ID Segment.Child x y z
1 64 165 1.3324 -2.6588 -0.2268
2 64 165 1.5139 -2.5820 0.1175
3 64 165 1.4286 -2.6364 -0.0536
4 64 165 1.4241 -2.6117 -0.2235
5 64 165 1.4838 -2.6035 0.0825
6 64 165 1.6572 -2.5004 -0.0235
7 64 165 1.3774 -2.6605 0.0900
8 64 165 1.3506 -2.6740 -0.0925
9 64 165 1.4585 -2.5935 -0.2201
10 64 165 1.4487 -2.6202 -0.1093
11 64 165 1.3985 -2.6531 -0.0424
12 64 165 1.4575 -2.5965 -0.2104
13 64 165 1.5355 -2.5772 0.0102
14 64 165 1.5240 -2.5769 -0.1107
15 819 166 -1.4562 2.5939 -0.2235
16 819 166 -1.4989 2.5819 -0.1701
17 819 166 -1.3234 2.6788 -0.1553
18 819 166 -1.3064 2.6859 -0.1623
19 819 166 -1.4328 2.6352 0.0303
20 819 166 -1.4572 2.6197 0.0670
21 819 166 -1.6413 2.5055 -0.0980
22 819 166 -1.5168 2.5865 0.0552
23 819 166 -1.5143 2.5889 -0.0375
24 819 166 -1.3334 2.6861 -0.0478
25 819 166 -1.3388 2.6804 -0.0876
26 819 166 -1.4489 2.6259 -0.0421
27 839 425 2.8302 0.9714 0.1239
28 839 425 2.7954 1.0744 0.1024
29 839 425 2.7172 1.2425 0.1556
30 839 425 2.7929 1.0780 -0.1123
31 839 425 2.8059 1.0436 0.1121
32 839 425 2.8309 0.9756 -0.1071
33 839 425 2.7778 1.1314 0.0350
34 839 425 2.7531 1.1907 0.0281
35 839 425 2.8049 1.0642 -0.0066
36 839 425 2.8666 0.8810 0.0472
37 839 425 2.7736 1.1191 0.1346
38 839 425 2.8352 0.9279 0.1828
39 839 425 2.7399 1.1841 0.1734
40 839 425 2.8198 1.0163 0.0732
41 839 425 2.8123 1.0309 0.0965
42 800 595 2.9378 0.6073 -0.0134
43 800 595 2.9338 0.5885 0.1242
44 800 595 2.9436 0.5597 0.0852
45 800 595 2.8827 0.7950 0.1391
46 800 595 2.9308 0.5937 0.1390
47 800 595 2.9066 0.7312 0.0755
48 800 595 2.9099 0.5923 0.2447
49 800 595 2.9383 0.6049 0.0126
50 800 595 2.9290 0.6330 -0.0821
51 800 595 2.9056 0.5665 0.2788
52 800 595 2.9156 0.7037 -0.0376
53 39 697 -2.5955 -1.4823 0.1483
54 39 697 -2.6052 -1.4771 0.1019
55 39 697 -2.5256 -1.6157 0.0609
56 39 697 -2.6147 -1.4575 0.1138
57 39 697 -2.5697 -1.5461 0.0466
58 39 697 -2.5655 -1.5213 0.1855
59 39 697 -2.5258 -1.5891 0.1775
60 39 697 -2.4889 -1.6570 0.1411
61 39 697 -2.5673 -1.5518 -0.0157
62 79 822 -2.7984 -1.0772 0.0543
63 79 822 -2.7636 -1.1210 -0.1873
64 79 822 -2.7651 -1.1357 -0.1463
65 79 822 -2.7255 -1.2424 -0.0963
66 79 822 -2.8309 -0.9764 0.1042
67 79 822 -2.8204 -0.9948 -0.1359
68 79 822 -2.7876 -1.1062 0.0432
69 79 822 -2.7586 -1.1768 -0.0423
70 79 822 -2.7917 -1.0899 -0.0787
71 36 875 0.7839 -2.8918 0.0879
72 36 875 0.4498 -2.9486 0.1852
73 36 875 0.5570 -2.9287 0.1932
74 36 875 0.6597 -2.9266 0.0029
75 36 875 0.7650 -2.9007 0.0126
76 36 875 0.6418 -2.9220 0.1290
77 36 875 0.6082 -2.9195 0.1879
78 36 875 0.6415 -2.8840 0.2982
79 36 875 0.4248 -2.9574 0.1560
80 36 875 0.6271 -2.9079 0.2233
81 36 875 0.4684 -2.9614 0.0604
82 36 875 0.6822 -2.8930 0.2337
83 36 875 0.5300 -2.9477 0.1004

5.1 Voronoi Tessellation with highlighted novelty cell

The plotNovelCells function is used to plot the Voronoi tessellation using the compressed HVT map (torus_mapA) containing 900 cells and highlights the identified novelty cell(s) i.e 7 cells (containing 83 records) in red on the map.

plotNovelCells(identified_Novelty_cells, torus_mapA,line.width = c(0.4),centroid.size = 0.01)
Figure 9: The Voronoi Tessellation constructed using the compressed HVT map (map A) with the novelty cell(s) highlighted in red

Figure 9: The Voronoi Tessellation constructed using the compressed HVT map (map A) with the novelty cell(s) highlighted in red

We pass the dataframe with novelty records (83 records) to trainHVT function along with other model parameters mentioned below to generate map B (layer2)

Model Parameters

colnames(data_with_novelty) <- c("Cell.ID","Segment.Child","x","y","z")
data_with_novelty <- data_with_novelty[,-1:-2]
torus_mapB <- list()
mapA_scale_summary = torus_mapA[[3]]$scale_summary
torus_mapB <- trainHVT(data_with_novelty,
                  n_cells = 11,   
                  depth = 1,
                  quant.err = 0.1,
                  normalize = FALSE,
                  distance_metric = "L1_Norm",
                  error_metric = "max",
                  quant_method = "kmeans"
                  )

The datatable displayed below is the summary from map B (layer 2) showing Cell.ID, Centroids and Quantization Error for each of the 11 cells.

displayTable(data =torus_mapB[[3]][['summary']], columnName= 'Quant.Error', value = 0.1, tableType = "summary")
Segment.Level Segment.Parent Segment.Child n Cell.ID Quant.Error x y z
1 1 1 8 9 0.08 0.68 -2.91 0.15
1 1 2 9 3 0.11 -2.78 -1.10 -0.05
1 1 3 5 5 0.05 2.75 1.17 0.11
1 1 4 7 1 0.1 -1.50 2.59 -0.03
1 1 5 10 6 0.09 2.82 1.01 0.05
1 1 6 9 4 0.07 -2.56 -1.54 0.11
1 1 7 5 8 0.05 0.49 -2.95 0.14
1 1 8 5 2 0.09 -1.35 2.67 -0.14
1 1 9 9 10 0.07 1.42 -2.62 -0.14
1 1 10 11 7 0.09 2.92 0.63 0.09
1 1 11 5 11 0.1 1.51 -2.58 0.06

Now let’s check the compression summary for HVT (torus_mapB). The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.

displayTable(data = torus_mapB[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 11 9 0.82 n_cells: 11 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans

As it can be seen from the table above, 82% of the cells have hit the quantization threshold error.Since we are successfully able to attain the desired compression percentage, so we will not further subdivide the cells

6. Map C : Compressed Map without Novelty

Let us try to visualize the compressed Map C from the diagram below.

Figure 10:Data Segregation with highlighted bounding box in red around compressed map C

Figure 10:Data Segregation with highlighted bounding box in red around compressed map C

6.1 Iteration 1

With the Novelties removed, we construct another hierarchical Voronoi tessellation map C layer 2 on the data without Novelty (containing 9517 records) and below mentioned model parameters.

Model Parameters

torus_mapC <- list()
torus_mapC <- trainHVT(data_without_novelty,
                  n_cells = 10,
                  depth = 2,
                  quant.err = 0.1,
                  normalize = FALSE,
                  distance_metric = "L1_Norm",
                  error_metric = "max",
                  quant_method = "kmeans")

Now let’s check the compression summary for HVT (torus_mapC) where n_cell was set to 10. The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.

displayTable(data = torus_mapC[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 10 0 0 n_cells: 10 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans
2 100 0 0 n_cells: 10 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans

As it can be seen from the table above, 0% of the cells have hit the quantization threshold error in level 1 and 0% of the cells have hit the quantization threshold error in level 2

6.2 Iteration 2

Since, we are yet to achive atleast 80% compression at depth 2. Let’s try to compress again using the below mentioned set of model parameters and the data without novelty (containing 9517 records).

Model Parameters

torus_mapC <- list()
torus_mapC <- trainHVT(data_without_novelty,
                  n_cells = 30,    
                  depth = 2,
                  quant.err = 0.1,
                  normalize = FALSE,
                  distance_metric = "L1_Norm",
                  error_metric = "max",
                  quant_method = "kmeans")

The datatable displayed below is the summary from map C (layer2). showing Cell.ID, Centroids and Quantization Error for each of the 928 cells.

displayTable(data =torus_mapC[[3]][['summary']], columnName= 'Quant.Error', value = 0.1, tableType = "summary")
Segment.Level Segment.Parent Segment.Child n Cell.ID Quant.Error x y z
1 1 1 426 613 0.5 0.84 -0.88 0.56
1 1 2 363 590 0.48 0.96 1.32 -0.85
1 1 3 321 236 0.6 -0.94 1.78 -0.88
1 1 4 311 628 0.67 0.48 -2.29 0.78
1 1 5 286 819 0.56 2.40 0.88 -0.62
1 1 6 424 584 0.49 0.92 0.51 0.26
1 1 7 269 727 0.58 1.50 2.39 0.05
1 1 8 261 55 0.54 -2.31 1.42 -0.46
1 1 9 269 886 0.51 2.68 -0.71 -0.34
1 1 10 250 122 0.54 -2.32 -1.17 0.62
1 1 11 358 261 0.45 -1.32 -0.25 0.67
1 1 12 373 685 0.47 1.34 -0.30 -0.72
1 1 13 399 430 0.44 -0.10 1.02 -0.20
1 1 14 350 244 0.59 -1.45 -1.39 -0.89
1 1 15 353 287 0.51 -0.97 0.85 0.62
1 1 16 282 386 0.43 -0.74 -1.55 0.88
1 1 17 253 89 0.55 -2.39 0.56 0.76
1 1 18 287 728 0.58 1.75 1.29 0.86
1 1 19 263 777 0.62 1.50 -1.82 -0.81
1 1 20 259 563 0.57 0.09 -2.61 -0.58
1 1 21 242 210 0.6 -1.46 -2.44 0.04
1 1 22 384 443 0.54 0.13 1.87 0.87
1 1 23 293 147 0.57 -1.41 2.13 0.66
1 1 24 302 807 0.6 2.24 -0.01 0.82
1 1 25 341 538 0.53 0.27 -1.23 -0.64
1 1 26 266 391 0.56 -0.03 2.71 -0.39
1 1 27 255 839 0.64 1.99 -1.75 0.55
1 1 28 378 406 0.48 -0.57 -0.83 -0.03
1 1 29 265 77 0.54 -2.58 -0.30 -0.61
1 1 30 434 250 0.5 -1.30 0.19 -0.67
2 1 1 18 626 0.12 1.01 -0.31 0.33
2 1 2 12 710 0.05 1.44 -0.76 0.93
2 1 3 13 573 0.07 0.62 -0.86 0.35
2 1 4 15 574 0.07 0.54 -1.25 0.77
2 1 5 13 612 0.08 0.87 -0.68 0.45
2 1 6 17 567 0.07 0.61 -0.80 0.07
2 1 7 12 701 0.08 1.42 -0.53 0.87
2 1 8 13 634 0.05 1.03 -0.49 0.51
2 1 9 11 539 0.08 0.29 -1.38 0.81
2 1 10 19 601 0.09 0.79 -0.90 0.59
2 1 11 17 609 0.07 0.90 -0.52 0.27
2 1 12 8 744 0.06 1.63 -0.87 0.99
2 1 13 11 671 0.09 1.30 -0.32 0.74
2 1 14 21 542 0.07 0.42 -0.92 0.15
2 1 15 13 591 0.08 0.80 -0.60 -0.02
2 1 16 19 650 0.09 0.99 -1.02 0.81
2 1 17 19 611 0.11 0.66 -1.41 0.89
2 1 18 16 674 0.11 1.04 -1.42 0.97
2 1 19 11 707 0.08 1.25 -1.27 0.97
2 1 20 13 646 0.07 1.06 -0.69 0.68
2 1 21 17 557 0.11 0.51 -1.02 0.51
2 1 22 7 589 0.04 0.75 -0.68 -0.15
2 1 23 21 525 0.1 0.25 -1.19 0.63
2 1 24 9 677 0.07 1.24 -0.80 0.85
2 1 25 11 726 0.06 1.45 -1.11 0.98
2 1 26 10 592 0.07 0.79 -0.65 0.22
2 1 27 17 623 0.08 0.80 -1.17 0.81
2 1 28 13 662 0.08 1.19 -0.44 0.68
2 1 29 20 524 0.09 0.24 -1.02 0.29
2 1 30 10 608 0.06 0.93 -0.36 0.07
2 2 1 8 667 0.06 1.34 0.81 -0.90
2 2 2 26 512 0.1 0.49 1.26 -0.75
2 2 3 10 653 0.09 1.24 1.35 -0.98
2 2 4 15 587 0.08 0.94 0.75 -0.60
2 2 5 7 621 0.06 1.08 0.57 -0.63
2 2 6 17 739 0.12 1.72 1.66 -0.92
2 2 7 5 541 0.05 0.70 2.20 -0.95
2 2 8 13 630 0.07 1.15 0.99 -0.87
2 2 9 9 665 0.06 1.33 1.09 -0.96
2 2 10 9 514 0.07 0.51 1.91 -1.00
2 2 11 12 599 0.07 1.00 1.56 -0.99
2 2 12 16 513 0.09 0.50 1.53 -0.92
2 2 13 11 669 0.1 1.34 0.65 -0.86
2 2 14 15 588 0.06 0.95 1.05 -0.81
2 2 15 15 529 0.11 0.61 0.97 -0.51
2 2 16 11 679 0.09 1.36 1.98 -0.91
2 2 17 13 614 0.1 1.01 1.85 -0.99
2 2 18 13 691 0.09 1.43 1.60 -0.98
2 2 19 17 553 0.1 0.77 0.87 -0.54
2 2 20 10 631 0.06 1.07 2.07 -0.94
2 2 21 7 615 0.05 0.98 2.27 -0.88
2 2 22 12 712 0.09 1.62 1.19 -0.99
2 2 23 14 555 0.06 0.79 1.42 -0.92
2 2 24 9 463 0.09 0.25 1.81 -0.98
2 2 25 10 705 0.1 1.56 0.80 -0.97
2 2 26 8 625 0.06 1.11 0.73 -0.74
2 2 27 9 467 0.08 0.22 1.47 -0.86
2 2 28 13 546 0.07 0.72 1.12 -0.74
2 2 29 16 581 0.09 0.90 1.25 -0.88
2 2 30 13 550 0.07 0.75 1.64 -0.98
2 3 1 13 196 0.11 -0.95 2.30 -0.87
2 3 2 14 235 0.07 -0.95 1.66 -0.99
2 3 3 10 338 0.07 -0.39 2.12 -0.99
2 3 4 11 328 0.07 -0.53 1.71 -0.98
2 3 5 13 141 0.13 -1.21 2.56 -0.54
2 3 6 13 166 0.07 -1.52 1.40 -1.00
2 3 7 10 399 0.08 -0.14 1.63 -0.93
2 3 8 5 260 0.05 -0.61 2.35 -0.90
2 3 9 7 217 0.05 -0.79 2.41 -0.84
2 3 10 10 389 0.1 -0.16 1.96 -1.00

Now let’s check the compression summary for HVT (torus_mapC). The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.

displayTable(data = torus_mapC[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 30 0 0 n_cells: 30 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans
2 898 739 0.82 n_cells: 30 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans

As it can be seen from the table above, 0% of the cells have hit the quantization threshold error in level 1 and 82% of the cells have hit the quantization threshold error in level 2

Let’s plot the Voronoi tessellation for layer 2 (map C)

plotHVT(torus_mapC,
        line.width = c(0.2,0.1), 
        color.vec = c("navyblue","steelblue"),
        centroid.size = 0.1,
        maxDepth = 2, 
        plot.type = '2Dhvt')
Figure 11: The Voronoi Tessellation for layer 2 (map C) shown for the 928 cells in the dataset ’torus’ at level 2

Figure 11: The Voronoi Tessellation for layer 2 (map C) shown for the 928 cells in the dataset ’torus’ at level 2

6.3 Heatmaps

Now let’s plot all the features for each cell at level two as a heatmap for better visualization.

The heatmaps displayed below provides a visual representation of the spatial characteristics of the torus dataset, allowing us to observe patterns and trends in the distribution of each of the features (x,y,z). The sheer green shades highlight regions with higher values in each of the heatmaps, while the indigo shades indicate areas with the lowest values in each of the heatmaps. By analyzing these heatmaps, we can gain insights into the variations and relationships between each of these features within the torus dataset.

  plotHVT(
  torus_mapC,
  child.level = 2,
  hmap.cols = "x",
  line.width = c(0.2,0.1),
  color.vec = c("navyblue","steelblue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
) 
Figure 12: The Voronoi Tessellation with the heat map overlaid for feature `x` in the ’torus’ dataset

Figure 12: The Voronoi Tessellation with the heat map overlaid for feature x in the ’torus’ dataset


  plotHVT(
  torus_mapC,
  child.level = 2,
  hmap.cols = "y",
  line.width = c(0.2,0.1),
  color.vec = c("navyblue","steelblue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
) 
Figure 13: The Voronoi Tessellation with the heat map overlaid for feature `y` in the ’torus’ dataset

Figure 13: The Voronoi Tessellation with the heat map overlaid for feature y in the ’torus’ dataset


  plotHVT(
  torus_mapC,
  child.level = 2,
  hmap.cols = "z",
  line.width = c(0.2,0.1),
  color.vec = c("navyblue","steelblue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
) 
Figure 14: The Voronoi Tessellation with the heat map overlaid for feature `z` in the ’torus’ dataset

Figure 14: The Voronoi Tessellation with the heat map overlaid for feature z in the ’torus’ dataset

We now have the set of maps (map A, map B & map C) which will be used to score, which map and cell each test record is assigned to.

7. Scoring

Now once we have built the model, let us try to score using our testing dataset (containing 2400 data points) which cell and which layer each point belongs to.

The scoreLayeredHVT function is used to score the testing dataset using the scored set of maps. This function takes an input - a testing dataset and a set of maps (map A, map B, map C).

Now, Let us understand the scoreLayeredHVT function.

scoreLayeredHVT(data,
                hvt_mapA,
                hvt_mapB,
                hvt_mapC,
                child.level = 1,
                mad.threshold = 0.2,
                normalize = TRUE, 
                seed=300,
                distance_metric="L1_Norm",
                error_metric="max",
                yVar= NULL)

Each of the parameters of scoreLayeredHVT function has been explained below:

Before that, the approach of scoreLayeredHVT function is to use scoreHVT function to score the test data against the given results of trainHVT which is referred as ‘map’ here. Hence the scoreLayeredHVT scores the test dataset against map A, B & C and further process and merge the final output. So the arguments used in scoreHVT is important here for smooth execution of function.

When normalize is set to TRUE, the scoreHVT function has an inbuilt feature to standardize the testing dataset based on the mean and standard deviation of the training dataset from the trainHVT results.

The function score based on the HVT maps - map A, map B and map C, constructed using trainHVT function. For each test record, the function will assign that record to Layer1 or Layer2. Layer1 contains the cell ids from map A and Layer 2 contains cell ids from map B (novelty map) and map C (map without novelty).

Scoring Algorithm

The Scoring algorithm recursively calculates the distance between each point in the testing dataset and the cell centroids for each level. The following steps explain the scoring method for a single point in the test dataset:

  1. Calculate the distance between the point and the centroid of all the cells in the first level.
  2. Find the cell whose centroid has minimum distance to the point.
  3. Check if the cell drills down further to form more cells.
  4. If it doesn’t, return the path. Or else repeat steps 1 to 4 till we reach a level at which the cell doesn’t drill down further.

Note : The Scoring algorithm will not work if some of the variables used to perform quantization are missing. In the testing dataset, we should not remove any features.

validation_data <- torus_test
new_score <- scoreLayeredHVT(
    data=validation_data,
    hvt_mapA = torus_mapA,
    hvt_mapB = torus_mapB,
    hvt_mapC = torus_mapC,
    normalize = FALSE
  )

Let’s see which cell and layer each point belongs to and check the Mean Absolute Difference for each of the 2400 records. For the sake of brevity, we are only displaying the first 100 rows.

act_pred <- new_score[["actual_predictedTable"]]
rownames(act_pred) <- NULL
act_pred %>% head(100) %>%as.data.frame() %>%Table(scroll = TRUE)
Row.Number act_x act_y act_z Layer1.Cell.ID Layer2.Cell.ID pred_x pred_y pred_z diff
1 -2.6282 0.5656 -0.7253 A426 C77 -2.5813521 -0.2999468 -0.6123004 0.3417981
2 2.7471 -0.9987 -0.3848 A383 C886 2.6772045 -0.7129922 -0.3361825 0.1347403
3 -2.4446 -1.6528 0.3097 A43 C122 -2.3163828 -1.1699452 0.6168016 0.3060579
4 -2.6487 -0.5745 0.7040 A137 C122 -2.3163828 -1.1699452 0.6168016 0.3383203
5 -0.2676 -1.0800 -0.4611 A280 C538 0.2727044 -1.2341352 -0.6402883 0.2912093
6 -1.1130 -0.6516 -0.7040 A302 C250 -1.3009074 0.1932009 -0.6693836 0.3557749
7 2.0288 1.9519 0.5790 A872 C728 1.7540533 1.2884739 0.8592087 0.4061272
8 -2.4799 1.6863 -0.0470 A706 C55 -2.3105632 1.4238165 -0.4643663 0.2830622
9 -0.4105 -1.1610 -0.6398 A254 C538 0.2727044 -1.2341352 -0.6402883 0.2522760
10 -0.2545 -1.6160 -0.9314 A177 C538 0.2727044 -1.2341352 -0.6402883 0.4000603
11 1.1500 0.3945 -0.6205 A551 C685 1.3422137 -0.3020255 -0.7171627 0.3284673
12 -1.2557 -1.1369 0.9520 A179 C386 -0.7375099 -1.5507936 0.8778894 0.3353981
13 -0.5449 -2.6892 -0.6684 A28 C563 0.0872637 -2.6105780 -0.5796456 0.2665134
14 2.9093 0.7222 -0.0697 A800 B7 2.9212455 0.6341636 0.0878182 0.0858333
15 2.3205 1.2520 -0.7711 A827 C819 2.4028203 0.8791671 -0.6244483 0.2006016
16 1.4772 -0.5194 -0.9008 A461 C685 1.3422137 -0.3020255 -0.7171627 0.1786660
17 -1.3176 -2.6541 0.2690 A3 C210 -1.4626517 -2.4376124 0.0398723 0.1968890
18 1.0687 0.1211 -0.3812 A513 C685 1.3422137 -0.3020255 -0.7171627 0.3442006
19 -0.9632 0.3283 -0.1866 A463 C250 -1.3009074 0.1932009 -0.6693836 0.3185300
20 2.5616 0.4634 0.7976 A761 C807 2.2367156 -0.0082026 0.8223858 0.2737576
21 2.8473 -0.9303 -0.0955 A389 C886 2.6772045 -0.7129922 -0.3361825 0.2093620
22 -0.5293 -0.8566 0.1173 A320 C406 -0.5681405 -0.8276460 -0.0344534 0.0731826
23 -1.9898 -2.1766 0.3150 A4 C210 -1.4626517 -2.4376124 0.0398723 0.3544295
24 -0.8845 -1.2219 -0.8709 A243 C244 -1.4496154 -1.3855974 -0.8882494 0.2487208
25 0.1553 2.2566 0.9651 A791 C443 0.1333753 1.8734750 0.8721690 0.1659936
26 2.4262 -0.6069 -0.8655 A459 C886 2.6772045 -0.7129922 -0.3361825 0.2954714
27 -0.0667 -1.4627 -0.8444 A225 C538 0.2727044 -1.2341352 -0.6402883 0.2573603
28 -0.0655 -1.3311 -0.7448 A268 C538 0.2727044 -1.2341352 -0.6402883 0.1798936
29 1.9592 1.5104 0.8806 A804 C728 1.7540533 1.2884739 0.8592087 0.1494880
30 1.2332 2.5452 0.5603 A865 C727 1.5048349 2.3899870 0.0467405 0.3134691
31 -0.8720 0.4903 0.0287 A483 C287 -0.9738620 0.8474068 0.6197989 0.3500226
32 0.2194 -1.7686 0.9760 A159 C628 0.4814534 -2.2922875 0.7817855 0.3266518
33 1.5052 0.0445 -0.8694 A532 C685 1.3422137 -0.3020255 -0.7171627 0.2205830
34 -2.8410 -0.8651 0.2439 A103 C122 -2.3163828 -1.1699452 0.6168016 0.4007880
35 1.3203 -2.5967 0.4077 A63 C628 0.4814534 -2.2922875 0.7817855 0.5057816
36 -1.5648 1.5577 0.9781 A650 C147 -1.4121276 2.1263891 0.6619017 0.3458532
37 0.3589 -1.0419 -0.4400 A340 C538 0.2727044 -1.2341352 -0.6402883 0.1595730
38 -0.2900 -2.0106 0.9995 A130 C386 -0.7375099 -1.5507936 0.8778894 0.3429757
39 0.5300 1.3668 0.8455 A698 C443 0.1333753 1.8734750 0.8721690 0.3099896
40 1.0254 -0.6738 0.6344 A409 C613 0.8381096 -0.8753467 0.5607136 0.1541745
41 -0.9306 0.3664 0.0154 A483 C287 -0.9738620 0.8474068 0.6197989 0.3762226
42 2.3888 -1.0670 0.7875 A411 C807 2.2367156 -0.0082026 0.8223858 0.4152558
43 -0.9830 -0.2043 -0.0897 A408 C406 -0.5681405 -0.8276460 -0.0344534 0.3644840
44 0.9499 0.3135 0.0261 A541 C584 0.9179814 0.5079842 0.2631802 0.1544943
45 -1.8079 -1.4936 0.9386 A127 C122 -2.3163828 -1.1699452 0.6168016 0.3846453
46 1.8399 -1.9295 -0.7459 A160 C777 1.4974529 -1.8159289 -0.8105544 0.1735575
47 -0.3304 -1.8481 0.9925 A125 C386 -0.7375099 -1.5507936 0.8778894 0.2730090
48 -2.2806 -1.8984 0.2536 A15 C122 -2.3163828 -1.1699452 0.6168016 0.3758131
49 -2.3323 1.7320 0.4252 A739 C55 -2.3105632 1.4238165 -0.4643663 0.4064955
50 0.5520 0.8441 0.1308 A593 C584 0.9179814 0.5079842 0.2631802 0.2781591
51 -0.9449 2.2273 0.9078 A755 C147 -1.4121276 2.1263891 0.6619017 0.2713456
52 0.2334 -1.4612 -0.8540 A214 C538 0.2727044 -1.2341352 -0.6402883 0.1600270
53 2.7387 0.9703 0.4244 A817 C819 2.4028203 0.8791671 -0.6244483 0.4919536
54 0.3561 1.1619 -0.6199 A645 C590 0.9634804 1.3193923 -0.8514567 0.3321432
55 1.7006 1.5569 -0.9522 A808 C590 0.9634804 1.3193923 -0.8514567 0.3584568
56 1.7244 -0.5698 0.9829 A467 C807 2.2367156 -0.0082026 0.8223858 0.4114757
57 0.9922 1.1438 -0.8741 A713 C590 0.9634804 1.3193923 -0.8514567 0.0756517
58 -0.3022 -1.3611 0.7956 A227 C386 -0.7375099 -1.5507936 0.8778894 0.2357643
59 -0.9693 1.0602 0.8261 A542 C287 -0.9738620 0.8474068 0.6197989 0.1412188
60 1.1313 -0.3595 -0.5824 A485 C685 1.3422137 -0.3020255 -0.7171627 0.1343836
61 -0.7561 -2.5384 -0.7611 A60 C563 0.0872637 -2.6105780 -0.5796456 0.3656654
62 2.3168 1.8924 0.1302 A892 C727 1.5048349 2.3899870 0.0467405 0.4643372
63 1.2363 -2.6444 -0.3939 A56 C563 0.0872637 -2.6105780 -0.5796456 0.4562013
64 -1.3204 -0.6281 0.8430 A260 C261 -1.3167277 -0.2491240 0.6686894 0.1856530
65 1.3733 1.1877 0.9829 A716 C728 1.7540533 1.2884739 0.8592087 0.2017395
66 1.0874 -0.1278 0.4251 A511 C584 0.9179814 0.5079842 0.2631802 0.3223742
67 2.1300 -1.2171 -0.8914 A335 C777 1.4974529 -1.8159289 -0.8105544 0.4374072
68 1.6863 -0.5945 0.9773 A467 C807 2.2367156 -0.0082026 0.8223858 0.4305424
69 0.8504 1.0927 -0.7882 A681 C590 0.9634804 1.3193923 -0.8514567 0.1343432
70 0.3029 1.0731 0.4656 A630 C430 -0.0971827 1.0248170 -0.1998627 0.3712761
71 -1.4724 1.1331 0.9899 A567 C287 -0.9738620 0.8474068 0.6197989 0.3847774
72 -0.5452 -1.2243 0.7514 A223 C386 -0.7375099 -1.5507936 0.8778894 0.2150976
73 -1.6866 2.1137 0.7101 A763 C147 -1.4121276 2.1263891 0.6619017 0.1117866
74 1.2012 -2.0386 -0.9305 A163 C777 1.4974529 -1.8159289 -0.8105544 0.2129565
75 -0.2108 2.3579 0.9301 A791 C443 0.1333753 1.8734750 0.8721690 0.2955104
76 -0.5982 1.3776 -0.8671 A656 C236 -0.9350442 1.7816903 -0.8837452 0.2525266
77 -0.2116 -1.0573 -0.3878 A303 C538 0.2727044 -1.2341352 -0.6402883 0.3045426
78 -0.7802 -0.9000 -0.5880 A275 C406 -0.5681405 -0.8276460 -0.0344534 0.2793200
79 1.0850 -1.6815 1.0000 A182 C839 1.9894376 -1.7503337 0.5523647 0.4736356
80 1.5563 0.1715 -0.9008 A617 C685 1.3422137 -0.3020255 -0.7171627 0.2904164
81 -0.3790 1.4273 0.8522 A652 C443 0.1333753 1.8734750 0.8721690 0.3261731
82 -1.2769 -0.2633 0.7178 A347 C261 -1.3167277 -0.2491240 0.6686894 0.0343714
83 -1.6039 2.4566 0.3575 A798 C147 -1.4121276 2.1263891 0.6619017 0.2754617
84 -0.9297 2.4281 -0.8000 A797 C236 -0.9350442 1.7816903 -0.8837452 0.2451664
85 0.5324 -0.8526 0.1016 A376 C613 0.8381096 -0.8753467 0.5607136 0.2625233
86 0.3928 1.5433 -0.9132 A722 C590 0.9634804 1.3193923 -0.8514567 0.2854438
87 1.0031 0.3850 -0.3786 A543 C584 0.9179814 0.5079842 0.2631802 0.2832943
88 -0.7562 0.7889 -0.4207 A536 C430 -0.0971827 1.0248170 -0.1998627 0.3719239
89 -1.0870 -0.7523 -0.7350 A302 C244 -1.4496154 -1.3855974 -0.8882494 0.3830541
90 -1.8671 -0.8423 -0.9988 A199 C244 -1.4496154 -1.3855974 -0.8882494 0.3571109
91 0.8325 -0.9413 0.6689 A351 C613 0.8381096 -0.8753467 0.5607136 0.0599164
92 -0.3355 0.9636 0.2005 A574 C430 -0.0971827 1.0248170 -0.1998627 0.2332990
93 -1.0089 -0.6007 0.5639 A296 C261 -1.3167277 -0.2491240 0.6686894 0.2547310
94 1.7725 1.7153 -0.8845 A833 C590 0.9634804 1.3193923 -0.8514567 0.4126568
95 0.5539 -0.8888 0.3037 A360 C613 0.8381096 -0.8753467 0.5607136 0.1848922
96 0.8149 -2.6016 0.6874 A73 C628 0.4814534 -2.2922875 0.7817855 0.2457149
97 0.1104 1.7654 -0.9729 A757 C236 -0.9350442 1.7816903 -0.8837452 0.3836298
98 1.0107 0.3118 0.3349 A537 C584 0.9179814 0.5079842 0.2631802 0.1202075
99 2.2697 -0.3642 0.9543 A473 C807 2.2367156 -0.0082026 0.8223858 0.1736320
100 0.4983 -0.8672 -0.0185 A376 C613 0.8381096 -0.8753467 0.5607136 0.3090567
hist(act_pred$diff, breaks = 30, col = "blue", main = "Mean Absolute Difference", xlab = "Difference")
Figure 16: Mean Absolute Difference

Figure 16: Mean Absolute Difference

8. Executive Summary

9. References

  1. Topology Preserving Maps : https://users.ics.aalto.fi/jhollmen/dippa/node9.html

  2. Vector Quantization : https://en.wikipedia.org/wiki/Vector_quantization

  3. K-means : https://en.wikipedia.org/wiki/K-means_clustering

  4. Sammon’s Projection : https://en.wikipedia.org/wiki/Sammon_mapping

  5. Voronoi Tessellations : https://en.wikipedia.org/wiki/Centroidal_Voronoi_tessellation